Finding common protein interaction patterns across organisms

Protein interactions are an important resource to obtain an understanding of cell function. Recently, researchers have compared networks of interactions in order to understand network evolution. While current methods first infer homologs and then compare topologies, we here present a method which first searches for interesting topologies and then looks for homologs. PINA (protein interaction network analysis) takes the protein interaction networks of two organisms, scans both networks for subnetworks deemed interesting, and then tries to find orthologs among the interesting subnetworks. The application is very fast because orthology investigations are restricted to subnetworks like hubs and clusters that fulfill certain criteria regarding neighborhood and connectivity. Finally, the hubs or clusters found to be related can be visualized and analyzed according to protein annotation.


Introduction
Protein interactions play an important role in many cellular processes such as signaling, transcription regulation and multi-enzyme complexes. Interactions can be very strong as between coiled coils of myosin (Lupas et al 1991). They can be transient, as in case of bHLH (basic Helix-Loop-Helix) proteins (Bornberg-Bauer et al 1998, Amoutzias et al 2004. They can also be very unspecifi c, such as the ones mediated by the SH3 domain (Reiss and Schwikowski, 2004).
Over the last years many groups have studied network structures, mostly focusing on "global" features of interaction networks. Protein networks arrange in huge connected components. These have a few highly linked nodes and many sparsely linked ones. The average path from any given node to any other is short and this relationship has been termed small world behavior in analogy to social networks of mutual acquaintances. Models based on physical concepts have also been used to characterize global properties. For example, many networks feature a hub-like arrangement that often coincides with scalefree behavior ( Barabasi and Oltvai, 2004).
Several groups analyzed global properties of interaction networks with the goal to answer phylogenetic questions. The endosymbiotic hypothesis was confi rmed by identifi cation and analysis of the most ancient interactions (Qin et al 2003). Conservation of protein interactions across phyla has been observed in many cases (Matthews et al 2001;Ramani and Marcotte 2003;Bork et al 2004), for example for the transcriptional network which regulates the development of the heart. It is regulated by protein interactions which have been conserved at least since the last common ancestor of fl y and man (Cripps and Olson, 2002).
In a very recent strand of research, several groups have begun to systematically compare interaction networks between organisms, and of the network of one organism with itself (Matthews et al 2001;Kelley et al 2003). In the fi rst case, orthologous subnetworks are inferred. By analogy with sequenceonly analyses such as phylogeny reconstruction, "paralogous" subnetworks can be detected in the second case. The latter result from subnetwork duplications in a single species. In particular, the PathBlast tool can detect homology between linear network substructures by aligning "pathways" of prespecifi ed length between two networks, matching interacting proteins that are similar according to BLAST (Basic local alignment search tool, Altschul et al 1997) and allowing a limited amount of mismatches and gaps. In other words, PathBlast is based on the pairwise alignment of symbols representing interacting proteins deemed similar. More recently, PathBlast has been extended to work for more than two networks simultaneously, and for non-linear substructures (Sharan et al 2005). The latter extension is based on graph theory, and it involves a large amount of similarity searches. We were interested in a lightweight approach to fi nding orthologous substructures shared by two interaction networks.
Based on orthologous subnetworks, the function of some of the interacting proteins may be predicted, thus extending the well known "homology implies functional analogy" paradigm (Benner et al 2000, Fuellen et al 2005. On the other hand, Yu et al (2004) demonstrated that interactions in one organism can be predicted to occur in another if the corresponding orthologs can be found (see also Huang et al 2004;Brown and Jurisica 2005). For such predictions, orthologous subnetworks are particularly useful. Last not least, the analysis of orthologous subnetworks provides insights into evolution. For example, by comparing the networks of two species, their ancestral "core" network can be estimated.
Here we present a method for the identifi cation of orthologous subnetworks. Their identifi cation is computationally demanding if we base the analysis on an all-against-all comparison. Therefore, as described in the fi rst part of the paper, we calculate the clustering coeffi cient for all nodes in the fi rst network, identify the nodes with an interesting connectivity pattern such as hubs and clusters, and only search for putative ortholog matches of these nodes in the second network. The second part of the paper describes several case studies, demonstrating the use of our application.

Material and Methods
We used the protein interaction networks of Homo sapiens (homo), Mus musculus (mouse), Drosophila melanogaster (fl y) and Saccharomyces cerevisiae (yeast) taken from the BIND (Biomolecular Interaction Network) database (Alfarano et al 2005) as of January/February 2005. Given two networks, the clustering coeffi cient of each node i in a network is calculated according to the standard formula (see, eg Barabasi and Oltvai, 2004) where k i is the number of neighbors of node i and n i represents the number of connections of the neighbors of node i among themselves. The determination of n i is done in a straightforward way by comparing the list of neighbors of node i with the lists of neighbors of its neighbors, incrementing the value of n i for each match between these lists. We note that our naïve implementation only consumes a negligible fraction (much less than 1%) compared to the homology searches that follow (see below). Therefore, we did not consider any sophisticated approach (as described in, eg, Schank and Wagner, 2004) to calculating the clustering coeffi cient.
Given the clustering coeffi cient of all nodes in the networks to be compared, hubs and clusters are identifi ed in the fi rst network. We defi ne a hub as a node in the protein interaction network, which has a low clustering coefficient (per default, its value is required to be below the hub threshold η = 0.1) and many neighbors (more than four by default). We defi ne a cluster as a set of interacting nodes in the protein interaction network, where all nodes have a high clustering coeffi cient (per default, the corresponding cluster threshold γ is set to 0.1). Thus, to defi ne a cluster, a breadth-fi rst search is done until the clustering coeffi cient drops below γ, where the result of the search does not depend on its starting point. All thresholds were determined empirically. Only for the hubs and clusters of the fi rst network, Smith-Waterman (1981) alignments are calculated to fi nd putative orthologous proteins and, therefore, putative orthologous subnetworks in the second network. More specifi cally, an all-against-all comparison is performed for each hub of the fi rst network with each hub of the second network. For the best hub-hub matches, the peripheral proteins are compared as well, in an all-againstall fashion. The two hubs are visualized together with the peripheral proteins as two subnetworks, where edges denote the interactions. All comparisons that reveal above-threshold similarity are given a link (colored red in the visualization) between nodes of the two subnetworks. In case of a cluster (in the fi rst network), the protein with the highest clustering coeffi cient is compared to all proteins involved in clusters of the second network. Again, for the best matches, an allagainst-all comparison is performed, comparing the proteins forming the cluster in the fi rst network, with the proteins forming the cluster in the second network, yielding the similarity links. We do not use gapped BLAST (Altschul et al 1997), because the cost of calling BLAST externally more than outweighs the gain compared to using Smith-Waterman. The resulting similarity scores are normalized, by dividing them by the selfmatch similarity score of the protein used for the search. These scores are preferred to E-values or p-values, since they can be used directly as color intensities for the similarity links of our visualization (see Figure 1). For each subnetwork presented, the reciprocal search was also conducted, ie we used the putative ortholog to search the fi rst network. If this search fi nds the hub or cluster that we started with, resulting in reciprocal best hits, orthology is a reasonable hypothesis (cf., eg, Remm et al 2001). If the reciprocal search fi nds different hits, it is possible that paralogy (duplication and subsequent speciation) is the correct hypothesis. In the current implementation such cases are not considered further.
All code is written in JAVA and provided as open source; it is available at http://www. uni-muenster.de/bioinformatics/services. PINA provides a GUI as well as command-line support. The "prefuse" toolkit (Heer et al 2005) was integrated to provide network visualization. A screenshot of our application is given in Figure 1. It should be noted that no similarity analysis can provide definite statements on orthology. To begin with, "looking back in time" is dependent on models of sequence evolution, which may be incorrect. Further, it is possible that proteins were duplicated and differentially lost after subsequent speciation events, a case known as hidden paralogy (Martin and Burg, 2002). However, hidden paralogy is unlikely to occur in parallel for many proteins, so that structures of interacting proteins are less easily mistaken as orthologs if they are not. Figure 1: A screenshot of PINA. On the left/right, proteins of the fi rst/second network are listed, including the clustering coeffi cient, and a symbol (H or C), denoting a hub or a member of a cluster, respectively. Description lines for each protein can be inspected in the logging panel in the middle, which also provides network statistics. The subnetworks corresponding to the highlighted proteins (number 24 and number 11) can be visualized and analyzed for similarity. Similarity analysis can also be done comparing the hubs and clusters of an entire network against the other network.

Results and Discussion
We applied our orthologous subnetwork search to various networks from the BIND database, recovering observations that can be found in the literature. Three examples will be described and discussed in more detail; they were selected among all hub-hub and cluster-cluster similarity matches calculated by PINA, based on the criteria of high similarity scores and biological relevance. Figure 2 shows two subnetworks which are orthologous between Homo sapiens and Mus musculus and were found by PINA. Starting with the hub protein hREV1 (Homo), Rev1 (Mus) was found to match using Smith-Waterman. Moreover, each protein of the human subnetwork has a matching protein in the murine subnetwork; all proteins except hREV7/Rev7 are known to belong to the Y family of DNA polymerases; Rev7 belongs to the B subfamily. All these polymerases are translesion DNA synthases (TLS). Our fi ndings confi rm observations by Ohashi el al 2004, and by Guo et al 2003, both based on yeast two-hybrid assays. This is no surprise since the underlying BIND interaction data are based on these two papers. However, given the functional analysis by Guo et al and the interaction data obtained experimentally by Ohashi et al orthology, of subnetworks detected by PINA allows predicting some functional characterizations reported by Ohashi et al that were obtained before by Guo et al In both cases, Pol kappa, Pol eta and Pol iota as well as (h)Rev7 interact with the C-terminal portion of (h)Rev1, which is involved in mediating protein-protein interactions among DNA polymerases required for TLS, and the situation in human can be predicted from the one in mouse, and vice versa. Regarding the evolution of the network, we note that the Rev/Pol proteins are more similar across species than they are within a single species (see Figure 2 and Table 1), so the most parsimonious hypothesis is that the entire network already existed in the common ancestor of mouse and homo. This way, PINA has identifi ed an evolutionary conserved subnetwork in a fully automated fashion. Kannouche et al (2003) have shown that Pol iota and Pol eta can form (part of) a protein complex. This protein interaction is not yet included in the BIND database, and it can be predicted to exist in mouse as well. We propose that predictions of further protein interactions for already existing orthologous subnetworks are more reliable than predictions without this additional information, and we intent to use PINA for such predictions,  In Figure 3, three further orthologous subnetworks are described. The TLE1/Groucho networks (panel A) are not matching one-to-one as the Rev/ Pol networks do. Only some proteins feature high similarity with another protein. The two hubs, Groucho and TLE1 (transducin-like enhancer protein 1), match with a high similarity score, and they are the basis of calling both subnetworks orthologous. They both play a distinct role as transcriptional repressors of a variety of other proteins involved in transcriptional regulation (Zhu et al 2002, Lopez-Rios et al 2003. Optix / Sine    Oculis (SO) in fl y and the human Six family are transcription factors featuring a Homeo and a Six domain; they interact with Groucho and TLE1, respectively. PINA results can be used for a functional analysis of these orthologous subnetworks, as it was done before in case of the Rev/Pol networks, and similar opportunities exist for the murine system (Zhu et al 2002). Most interestingly, however, the orthologous subnetworks also confi rm hypotheses about the evolution of these networks. Similarity is strongest between Six3 and Six6, as well as between Six1 and Six2 (see Table  2), indicating recent duplications. Furthermore, Table 3 indicates that, as can already be inferred from the similarity edges in Fig. 3, Six3/6 and Optix can be assumed to have a common ancestor, just like Six 1/2 and SO. This assumption is confi rmed by Gallardo et al (1999) and it is corroborated by the phylogenetic tree analysis in Fig. 4, except that our analysis places Optix next to the root of the tree. Finally, Six3/6, Optix and Six1/2/ SO are probably the result of a (series of) duplications, inheriting the interaction with Groucho/ TLE. In such a way, using PINA results, we can estimate the evolution of the "core" network that goes back to the ancestor of Drosophila and Homo.
Another example from the comparison of Homo and Drosophila is shown in Fig 3, panel B. In this case, the human Myc-Max-Mad transcription factor network (Luscher 2001;Partlin et al 2003;Nair and Burley, 2003;Amoutzias et al 2004) is compared to its Drosophila counterpart (Giot et al 2003). Featuring just three interactions recorded in the BIND database, the Drosophila Max protein can barely be identifi ed as a hub. Furthermore, it is surprising that it is the human Mad protein (and not the human Max protein) that has strongest similarity links with the Drosophila Max (and Mnt) proteins. However, similarity may be misleading as far as evolutionary relationship is concerned (Koski and Golding, 2001), and a sequence-based phylogenetic tree indeed confi rms the putative correct evolutionary relationship, placing the Myc and the Max proteins in one subtree each, irrespective of the species they come from (data not shown).
Our fi nal example ( Figure. 3, panel C) describes orthologous subnetworks based on matching clusters by PINA. Gtr1p and Gtr2p from yeast as well as RagA and RagC from human are nuclear G proteins (Nakashima et al 1999;Sekiguchi et al 2001). G proteins are regulatory GTP hydrolases, which function as molecular switches and are involved in cell development. Rag proteins are known to feature homology with Gtr proteins, and to interact with each other, two observations that are also found by PINA. Additionally, our orthologous subnetworks display them together with two uncharacterized proteins. As soon as any one of these is studied in detail, we expect to learn about the other as well.
Naturally, due to our restriction to "interesting" network topologies (hubs, clusters), our method cannot normally fi nd all orthologous subnetworks. In brief, there are two extreme cases: On one hand, if the network clusters in one giant well-connected component, so that we have only one cluster and no hubs, then our approach will just do an allagainst-all comparison, and fi nd the "usual" putative orthologs. On the other hand, if the network is very disconnected, our approach yields very distinctive results, focusing on hubs as well as smaller well-connected clusters as candidates for orthologous subnetworks.

Conclusions
The results presented here show how a scheme fi rst looking for the topology of the protein-interaction network and then doing homology searches for the proteins involved can enhance our view on network evolution. Demonstrating the potential of comparative protein interaction network analysis (phyloproteomics), investigations of protein function and prediction of further interactions can be based on the orthologous subnetworks that we are able to identify using PINA. Future work includes comparative analyses across more than two species, and a formal approach towards reconstructing the evolution of interaction networks based on such data.